Before we start, some terms/concepts are important to understand.
Data types - Under the hood, R stores information in a few different ways:
Vectors - Vectors in R are akin to lists. All data is stored in a vector and can be accessed by index/position. Most vectors are created using c(“thing1”,“thing2”,…,“thingn”).
The pipe ( %>% ) - The pipe is a special symbol that allows us to write cleaner code by saying “put the output from the left function into the right function.”
Packages provide a lot of the functionality that makes R so useful. I’ve already installed all of the packages necessary to run this tutorial but if you need to install them on your computer, just type
#install.packages("PACKAGE_NAME",dependencies = TRUE)
If asked to compile packages, try the “Yes” option and, if it fails, redo with the “no” option. For packages from Bioconductor, just Google how to install.
First we load the core tidyverse package. This package includes:
We also load the ggrepel package that helps format ggplot output.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.2.1 ✓ purrr 0.3.3
## ✓ tibble 2.1.3 ✓ dplyr 0.8.4
## ✓ tidyr 1.0.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.4.0
## ── Conflicts ────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(ggrepel)
There are a lot of ways to get data into R, we will go over the readr and tibble packages
We can also manually create a tibble using a series of vectors. You simply specify the name of the column and the data corresponding to it.
OCH_tibble<-tibble::tibble(
names=c("Claire","Cory","Rachael"),
r_abil=c(10,5,8),
height=c(165.1,187.96,158.75)
)
print(OCH_tibble)
## # A tibble: 3 x 3
## names r_abil height
## <chr> <dbl> <dbl>
## 1 Claire 10 165.
## 2 Cory 5 188.
## 3 Rachael 8 159.
We can also load data set using read_csv. This creates a dataframe/tibble object. Think of this as one sheet from an Excel file where data is stored in rows and columns. Readr also has other functions for reading different file types (e.g. read_tsv). We can also write our data to an output file using various write_ functions.
Gap_minder<-readr::read_csv("GM_code_along.csv")
## Parsed with column specification:
## cols(
## country = col_character(),
## continent = col_character(),
## year = col_double(),
## lifeExp = col_double(),
## pop = col_double(),
## gdpPercap = col_double()
## )
print(Gap_minder)
## # A tibble: 1,705 x 6
## country continent year lifeExp pop gdpPercap
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
## 7 Afghanistan Asia 1982 39.9 12881816 978.
## 8 Afghanistan Asia 1987 40.8 13867957 852.
## 9 Afghanistan Asia 1992 41.7 16317921 649.
## 10 Afghanistan Asia 1997 41.8 22227415 635.
## # … with 1,695 more rows
readr::write_csv(OCH_tibble,"OCH.csv")
Lastly, you can load files with RStudio. To do this, simply navigate to the “Files” tab in the bottom right corner, find your file, click on it and select “Import Dataset.” To save, use one of the write_ functions mentioned above
Next we will use functions from the dplyr package to manipulate our dataset.
Select can be used to pick columns to include or exclude
# Only include country, continent, year, and lifeExp columns
Gap_minder%>%
dplyr::select(c(country, continent, year, lifeExp))%>%
colnames()
## [1] "country" "continent" "year" "lifeExp"
#Exclude gdpPercap column
Gap_minder%>%
dplyr::select(-gdpPercap)%>%
colnames()
## [1] "country" "continent" "year" "lifeExp" "pop"
#Include columns that containt "co" (i.e. country and continent)
Gap_minder%>%
dplyr::select(dplyr::contains(c("co")))%>%
colnames()
## [1] "country" "continent"
#Reorder columns so year is first
Gap_minder%>%
dplyr::select(year,dplyr::everything())%>%
colnames()
## [1] "year" "country" "continent" "lifeExp" "pop" "gdpPercap"
#Select can be used to rename columns while filtering,
#or rename can be used to rename columns in place
Gap_minder%>%
select(year, selected=country)%>%
colnames()
## [1] "year" "selected"
Gap_minder%>%
rename(renamed=country)%>%
colnames()
## [1] "renamed" "continent" "year" "lifeExp" "pop" "gdpPercap"
Filter is used to include or exclude rows based on some logical parameter. Say we just want to compare GDP data for European countries within a certain year
Europe_1992<-Gap_minder%>%
dplyr::filter(dplyr::between(year, 1990,1994) & continent=="Europe")
ggplot2::ggplot(Europe_1992,aes(x=lifeExp,y=gdpPercap,fill=continent),colour="blue")+
ggplot2::geom_point()+
ggtitle("European Per Capita GDP 1992")+
ggrepel::geom_text_repel(aes(x=lifeExp,y=gdpPercap,label=country))+
ggplot2::xlim(c(65,80))+
ggplot2::ylim(c(2000,35000))
You may also want to filter to compare two countries
Alb_France<-Gap_minder%>%
filter( (country=="Albania" | country=="France") & year==1992)
ggplot(data=Alb_France,aes(x=lifeExp,y=gdpPercap,fill=continent),colour="Blue")+
geom_point()+
ggtitle("France & Albania Per Capita GDP 1992")+
geom_text_repel(aes(x=lifeExp,y=gdpPercap,label=country))+
xlim(c(65,80))+
ylim(c(2000,35000))
Or identify and remove NA values
Gap_minder%>%
filter(is.na(gdpPercap))
## # A tibble: 1 x 6
## country continent year lifeExp pop gdpPercap
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Coryland <NA> 2019 NA 1 NA
Gap_minder<- Gap_minder%>%
filter(!is.na(gdpPercap))
#You could also use filter(country!="Coryland")
Mutate is used to add new columns to a tibble, usually based on calculations involving and existing column.
Gap_minder%>%
mutate(gdptotal=gdpPercap*pop/(10^9))
## # A tibble: 1,704 x 7
## country continent year lifeExp pop gdpPercap gdptotal
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779. 6.57
## 2 Afghanistan Asia 1957 30.3 9240934 821. 7.59
## 3 Afghanistan Asia 1962 32.0 10267083 853. 8.76
## 4 Afghanistan Asia 1967 34.0 11537966 836. 9.65
## 5 Afghanistan Asia 1972 36.1 13079460 740. 9.68
## 6 Afghanistan Asia 1977 38.4 14880372 786. 11.7
## 7 Afghanistan Asia 1982 39.9 12881816 978. 12.6
## 8 Afghanistan Asia 1987 40.8 13867957 852. 11.8
## 9 Afghanistan Asia 1992 41.7 16317921 649. 10.6
## 10 Afghanistan Asia 1997 41.8 22227415 635. 14.1
## # … with 1,694 more rows
Gap_minder%>%
mutate(era=dplyr::if_else(year<2000,"20th","21st"))
## # A tibble: 1,704 x 7
## country continent year lifeExp pop gdpPercap era
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 Afghanistan Asia 1952 28.8 8425333 779. 20th
## 2 Afghanistan Asia 1957 30.3 9240934 821. 20th
## 3 Afghanistan Asia 1962 32.0 10267083 853. 20th
## 4 Afghanistan Asia 1967 34.0 11537966 836. 20th
## 5 Afghanistan Asia 1972 36.1 13079460 740. 20th
## 6 Afghanistan Asia 1977 38.4 14880372 786. 20th
## 7 Afghanistan Asia 1982 39.9 12881816 978. 20th
## 8 Afghanistan Asia 1987 40.8 13867957 852. 20th
## 9 Afghanistan Asia 1992 41.7 16317921 649. 20th
## 10 Afghanistan Asia 1997 41.8 22227415 635. 20th
## # … with 1,694 more rows
Decades<-Gap_minder%>%
mutate(era=dplyr::case_when(year<1960 ~ "50s",
year<1970 ~ "60s",
year<1980 ~ "70s",
year<1990 ~ "80s",
year<2000 ~ "90s",
year<2010 ~ "00s"))
ggplot(Decades,aes(x=era,y=lifeExp))+
ggtitle("Life Expectancy by Decade")+
geom_boxplot()+
ggplot2::geom_jitter(aes(colour=continent,alpha=0.3),
show.legend = FALSE)
# Check out mutate in the tidyverse ref manual for more useful functions such as
# lead and lag to grab the next or previous data point in a column or the cumulative
# series of functions
Often, you want a number summarizing particular groupings of data, e.g. what was the population increase in each country over the period observed? You can get this with group_by, summarize, and mutate
#Arrange sorts in ascending order using the variable(s) listed. Use desc(variable) to do the opposite.
#Group_by creates groups based on the levels in the column(s) you specify and then you can use summarize or
#mutate to manipulate data based on those groups. Mutate keeps the previous data structure,
#summarize makes a new, smaller dataset with one calculated value for each group.
#Summarize creates a smaller, summary tibble.
population_increase<-Gap_minder%>%
dplyr::arrange(year)%>%
dplyr::group_by(country)%>%
dplyr::summarize(
pop_inc=dplyr::last(pop)-dplyr::first(pop),
continent=unique(continent))
population_increase
## # A tibble: 142 x 3
## country pop_inc continent
## <chr> <dbl> <chr>
## 1 Afghanistan 23464590 Asia
## 2 Albania 2317826 Europe
## 3 Algeria 24053691 Africa
## 4 Angola 8188381 Africa
## 5 Argentina 22424971 Americas
## 6 Australia 11742964 Oceania
## 7 Austria 1272011 Europe
## 8 Bahrain 588126 Asia
## 9 Bangladesh 103561480 Asia
## 10 Belgium 1661821 Europe
## # … with 132 more rows
ggplot(population_increase,aes(x=continent,y=pop_inc,fill=continent),colour="black")+
ggtitle("Population Change 1952 to 2007")+
geom_violin()
#If you use mutate with group_by you add the values to
#the existing tibble.
Gap_minder%>%
dplyr::arrange(year)%>%
dplyr::group_by(country)%>%
mutate(pop_inc=dplyr::last(pop)-dplyr::first(pop))
## # A tibble: 1,704 x 7
## # Groups: country [142]
## country continent year lifeExp pop gdpPercap pop_inc
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779. 23464590
## 2 Albania Europe 1952 55.2 1282697 1601. 2317826
## 3 Algeria Africa 1952 43.1 9279525 2449. 24053691
## 4 Angola Africa 1952 30.0 4232095 3521. 8188381
## 5 Argentina Americas 1952 62.5 17876956 5911. 22424971
## 6 Australia Oceania 1952 69.1 8691212 10040. 11742964
## 7 Austria Europe 1952 66.8 6927772 6137. 1272011
## 8 Bahrain Asia 1952 50.9 120447 9867. 588126
## 9 Bangladesh Asia 1952 37.5 46886859 684. 103561480
## 10 Belgium Europe 1952 68 8730405 8343. 1661821
## # … with 1,694 more rows
#You can group_by multiple variables
median_pop<-Gap_minder%>%
dplyr::group_by(year, continent)%>%
dplyr::summarize(med_pop=median(pop))
ggplot(median_pop,aes(x=year,y=med_pop,fill=continent,colour=continent))+
ggtitle("Median Population by Year & Continent")+
geom_point()+
stat_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
You can use joins to merge two tibbles along shared column(s)
OCH_tibble
## # A tibble: 3 x 3
## names r_abil height
## <chr> <dbl> <dbl>
## 1 Claire 10 165.
## 2 Cory 5 188.
## 3 Rachael 8 159.
food_tibble<-tibble::tibble(
names=c("Claire","Rachael","Claus","Eduardo","Dariya"),
r_abil=c(10,8,11,7,7),
fave_fruit=c("grapes","apples","kiwi","plum","tomato")
)
#Keep all data from left tibble
OCH_tibble%>%
dplyr::left_join(food_tibble)
## Joining, by = c("names", "r_abil")
## # A tibble: 3 x 4
## names r_abil height fave_fruit
## <chr> <dbl> <dbl> <chr>
## 1 Claire 10 165. grapes
## 2 Cory 5 188. <NA>
## 3 Rachael 8 159. apples
knitr::include_graphics("animated-left-join.gif")
#Keep all data from right tibble
right_join(OCH_tibble,food_tibble)
## Joining, by = c("names", "r_abil")
## # A tibble: 5 x 4
## names r_abil height fave_fruit
## <chr> <dbl> <dbl> <chr>
## 1 Claire 10 165. grapes
## 2 Rachael 8 159. apples
## 3 Claus 11 NA kiwi
## 4 Eduardo 7 NA plum
## 5 Dariya 7 NA tomato
knitr::include_graphics("animated-right-join.gif")
#Keep all data
full_join(OCH_tibble,food_tibble)
## Joining, by = c("names", "r_abil")
## # A tibble: 6 x 4
## names r_abil height fave_fruit
## <chr> <dbl> <dbl> <chr>
## 1 Claire 10 165. grapes
## 2 Cory 5 188. <NA>
## 3 Rachael 8 159. apples
## 4 Claus 11 NA kiwi
## 5 Eduardo 7 NA plum
## 6 Dariya 7 NA tomato
knitr::include_graphics("animated-full-join.gif")
#Keep all data with observations in both tibbles
inner_join(OCH_tibble,food_tibble)
## Joining, by = c("names", "r_abil")
## # A tibble: 2 x 4
## names r_abil height fave_fruit
## <chr> <dbl> <dbl> <chr>
## 1 Claire 10 165. grapes
## 2 Rachael 8 159. apples
knitr::include_graphics("animated-inner-join.gif")
#Filter out values from left tibble that aren't in right tibble
semi_join(OCH_tibble,food_tibble)
## Joining, by = c("names", "r_abil")
## # A tibble: 2 x 3
## names r_abil height
## <chr> <dbl> <dbl>
## 1 Claire 10 165.
## 2 Rachael 8 159.
knitr::include_graphics("animated-semi-join.gif")
#Keep data that is unique to one tibble
anti_join(OCH_tibble,food_tibble)
## Joining, by = c("names", "r_abil")
## # A tibble: 1 x 3
## names r_abil height
## <chr> <dbl> <dbl>
## 1 Cory 5 188.
knitr::include_graphics("animated-anti-join.gif")
Pivot_wide and pivot_long (formerly spread and gather) are helpful for formatting data so it can be processed or visualized more easily
#Pivot_wider and pivot_longer replaced spread
#and gather in the newest version of tidyr.
#Code using the old functions is commented
#out below alongside the new functions
#Spread makes data wider
Gap_minder%>%
select(country, pop, year)%>%
tidyr::pivot_wider(names_from = year,
values_from = pop)
## # A tibble: 142 x 13
## country `1952` `1957` `1962` `1967` `1972` `1977` `1982` `1987` `1992` `1997`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Afghan… 8.43e6 9.24e6 1.03e7 1.15e7 1.31e7 1.49e7 1.29e7 1.39e7 1.63e7 2.22e7
## 2 Albania 1.28e6 1.48e6 1.73e6 1.98e6 2.26e6 2.51e6 2.78e6 3.08e6 3.33e6 3.43e6
## 3 Algeria 9.28e6 1.03e7 1.10e7 1.28e7 1.48e7 1.72e7 2.00e7 2.33e7 2.63e7 2.91e7
## 4 Angola 4.23e6 4.56e6 4.83e6 5.25e6 5.89e6 6.16e6 7.02e6 7.87e6 8.74e6 9.88e6
## 5 Argent… 1.79e7 1.96e7 2.13e7 2.29e7 2.48e7 2.70e7 2.93e7 3.16e7 3.40e7 3.62e7
## 6 Austra… 8.69e6 9.71e6 1.08e7 1.19e7 1.32e7 1.41e7 1.52e7 1.63e7 1.75e7 1.86e7
## 7 Austria 6.93e6 6.97e6 7.13e6 7.38e6 7.54e6 7.57e6 7.57e6 7.58e6 7.91e6 8.07e6
## 8 Bahrain 1.20e5 1.39e5 1.72e5 2.02e5 2.31e5 2.97e5 3.78e5 4.55e5 5.29e5 5.99e5
## 9 Bangla… 4.69e7 5.14e7 5.68e7 6.28e7 7.08e7 8.04e7 9.31e7 1.04e8 1.14e8 1.23e8
## 10 Belgium 8.73e6 8.99e6 9.22e6 9.56e6 9.71e6 9.82e6 9.86e6 9.87e6 1.00e7 1.02e7
## # … with 132 more rows, and 2 more variables: `2002` <dbl>, `2007` <dbl>
#tidyr::spread(key=year,value=pop)
#Gather makes data longer
Gap_minder%>%
select(country, pop, year)%>%
tidyr::pivot_wider(names_from = year,
values_from = pop)%>%
tidyr::pivot_longer(-country,
names_to="year",
values_to="pop"
)
## # A tibble: 1,704 x 3
## country year pop
## <chr> <chr> <dbl>
## 1 Afghanistan 1952 8425333
## 2 Afghanistan 1957 9240934
## 3 Afghanistan 1962 10267083
## 4 Afghanistan 1967 11537966
## 5 Afghanistan 1972 13079460
## 6 Afghanistan 1977 14880372
## 7 Afghanistan 1982 12881816
## 8 Afghanistan 1987 13867957
## 9 Afghanistan 1992 16317921
## 10 Afghanistan 1997 22227415
## # … with 1,694 more rows
# tidyr::spread(key=year,value=pop)%>%
# tidyr::gather(key=year,value=pop,-country)
Separate and unite are used to combine text across columns
food_tibble%>%
tidyr::unite(name_foods,names,fave_fruit)
## # A tibble: 5 x 2
## name_foods r_abil
## <chr> <dbl>
## 1 Claire_grapes 10
## 2 Rachael_apples 8
## 3 Claus_kiwi 11
## 4 Eduardo_plum 7
## 5 Dariya_tomato 7
food_tibble%>%
tidyr::unite(name_foods,names,fave_fruit)%>%
tidyr::separate(name_foods,c("names","fave_fruit"))
## # A tibble: 5 x 3
## names fave_fruit r_abil
## <chr> <chr> <dbl>
## 1 Claire grapes 10
## 2 Rachael apples 8
## 3 Claus kiwi 11
## 4 Eduardo plum 7
## 5 Dariya tomato 7
The stringr package has functions for processing text. Many of these take advantage of regular expressions (regex) which can be used to match complex patterns in a string variable.
#str_detect can be used to filter for a pattern. It returns a
#logical value (TRUE/FALSE)
Gap_minder%>%
filter(year==1992,
stringr::str_detect(country,"Al"))
## # A tibble: 2 x 6
## country continent year lifeExp pop gdpPercap
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Albania Europe 1992 71.6 3326498 2497.
## 2 Algeria Africa 1992 67.7 26298373 5023.
#str_extract returns matches from a string
Gap_minder%>%
mutate(short_cont=
stringr::str_extract(continent,"[a-zA-Z]{3}")%>%
toupper())
## # A tibble: 1,704 x 7
## country continent year lifeExp pop gdpPercap short_cont
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 Afghanistan Asia 1952 28.8 8425333 779. ASI
## 2 Afghanistan Asia 1957 30.3 9240934 821. ASI
## 3 Afghanistan Asia 1962 32.0 10267083 853. ASI
## 4 Afghanistan Asia 1967 34.0 11537966 836. ASI
## 5 Afghanistan Asia 1972 36.1 13079460 740. ASI
## 6 Afghanistan Asia 1977 38.4 14880372 786. ASI
## 7 Afghanistan Asia 1982 39.9 12881816 978. ASI
## 8 Afghanistan Asia 1987 40.8 13867957 852. ASI
## 9 Afghanistan Asia 1992 41.7 16317921 649. ASI
## 10 Afghanistan Asia 1997 41.8 22227415 635. ASI
## # … with 1,694 more rows
#str_replace_all can be used to swap one pattern for another
Gap_minder%>%
mutate(new_country=
stringr::str_replace_all(country,"[aeiou]","-"))
## # A tibble: 1,704 x 7
## country continent year lifeExp pop gdpPercap new_country
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 Afghanistan Asia 1952 28.8 8425333 779. Afgh-n-st-n
## 2 Afghanistan Asia 1957 30.3 9240934 821. Afgh-n-st-n
## 3 Afghanistan Asia 1962 32.0 10267083 853. Afgh-n-st-n
## 4 Afghanistan Asia 1967 34.0 11537966 836. Afgh-n-st-n
## 5 Afghanistan Asia 1972 36.1 13079460 740. Afgh-n-st-n
## 6 Afghanistan Asia 1977 38.4 14880372 786. Afgh-n-st-n
## 7 Afghanistan Asia 1982 39.9 12881816 978. Afgh-n-st-n
## 8 Afghanistan Asia 1987 40.8 13867957 852. Afgh-n-st-n
## 9 Afghanistan Asia 1992 41.7 16317921 649. Afgh-n-st-n
## 10 Afghanistan Asia 1997 41.8 22227415 635. Afgh-n-st-n
## # … with 1,694 more rows
#str_detect can also be used to conditionally mutate
labelled<-Gap_minder%>%
filter(year==1992 & continent=="Europe")%>%
mutate(country_label=if_else(
stringr::str_detect(country,"Albania|France"),
country,""))
ggplot(data=labelled,aes(x=lifeExp,y=gdpPercap,fill=continent),colour="Blue")+
geom_point()+
ggtitle("European Countries Per Capita GDP 1992")+
geom_text_repel(aes(x=lifeExp,y=gdpPercap,label=country_label))+
xlim(c(65,80))+
ylim(c(2000,35000))
Factors are text-based data that typically involve multiple observtions of the same text. R stores factors as integers, and this sometimes complicates figures and data analysis. Forcats provides functions the handle factors.
#era is a factor here, but the order is off because R sorts
#00s before 50s
ggplot(Decades,aes(x=era,y=lifeExp))+
ggtitle("Life Expectancy by Decade")+
geom_boxplot()+
ggplot2::geom_jitter(aes(colour=continent,alpha=0.3),
show.legend = FALSE)
#We can reorder the variable using fct_reorder!
Decades<-Decades%>%
mutate(era=fct_reorder(era,year))
#You can also do the same manually with fct_relevel
# Decades<-Decades%>%
# mutate(era=fct_relevel(era,"00s",after=Inf))
ggplot(Decades,aes(x=era,y=lifeExp))+
ggtitle("Life Expectancy by Decade (Corrected)")+
geom_boxplot()+
ggplot2::geom_jitter(aes(colour=continent,alpha=0.3),
show.legend = FALSE)
You can find a package to solve any problem but here are a few ones that I commonly use in my workflow:
Science Related
Figure Making
Data input/processing
Here are some other links/tutorials that might be helpful